10/601,741 



MS303962.01/MSFTP438US 



Amendments to the Claims 

This listing of claims will replace all prior versions, and listings, of claims in the 
application: 

Listing of Claims: 

1 . (Currently Amended) A spam detection system comprising: 

a message parsing component that identifies features relating to at least a portion 
of origination information of a message; and 

a feature pairing component that combines the features into useful pairs , the 
features of the pairs are evaluated for consistency with respect to one another to 
determine if the message is spam for use in connection with training a machine - loarning 
filter to facilitate detecting spam . 

2. (Currently Amended) The system of claim 1 , whoroin each pair comprises at least 
one of the following: 

at least one of a domain name and a host name in a MAIL FROM 

command; 

at least one of a domain name and a host name in a HELO COMMAND; 
at least one of an IP address and a subnet in a Received from header; 
at least one of a domain name and a host name in a Display name; 
at least one of a domain name and a host name in a Message From line; 

and 

at least one time zone in a last Received from header. 

3. (Currently Amended) The system of claim 2, wherein the domain name is derived 
from the host name. 

4. (Currently Amended) The system of claim 2, wherein the subnet comprises one or 
more IP addresses that share a first number of bits in common. 



2 



10/601,741 



MS303962.01/MSFTP438US 



5. (Currently Amended) The system of claim 1, wherein a useful pair is any one of a 
domain name and a host name from a Message From and from a HELO command. 

6. (Currently Amended) The system of claim 1, whoroin a useful pair is a Display 
name domain name and host name and a Message From domain name and host name. 

7. (Currently Amended) The system of claim 1, wherein a useful pair is any one of a 
domain name and a host name in a Message From and any one of a Received from IP 
address and subnet. 

8. (Currently Amended) The system of claim 1 , wherein a useful pair is a sender's 
alleged time zone and a Message From domain name. 

9. (Currently Amended) The system of claim 1 , wherein a useful pair comprises a 
sender's type of mailing software and any one of a domain name, host name and user 
name derived from one of an SMTP command and a message header. 

10. (Currently Amended) The system of claim 1 , wherein origination information 
comprises SMTP commands, the SMTP commands comprise a HELO command, a 
MAIL FROM command, and a DATA command. 

1 1 . (Currently Amended) The system of claim 1 0, wherein the DATA command 
comprises a Message From line, sender's alleged time zone, and sender's mailing 
software. 

12. (Original) The system of claim 1, further comprising a component that applies 
one or more heuristics consistently to mail messages to obtain consistent feature pairing. 
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1 3 . (Currently Amended) A spam detection system comprising: 

a character sequencing component that analyzes a portion of a message via 
searching for particular character sequences that are indicative of spam, wherein the 
particular sequences are not restricted to whole words; and 

a feature generating component that generates features relating to the character 
sequences of any length , the features are analyzed to detect at least one of intentional 
character substitutions, insertions, or misspellings indicative of spam . 

14. (Currently Amended) The system of claim 13, whoroin the feature generating 
component generates features for each run of characters up to a maximum character run 
length. 

15. (Currently Amended) The system of claim 13, wherein the feature generating 
component generates features for substantially all character sequences up to some length 
n. 

16. (Currently Amended) The system of claim 13, whoroin the character sequences 
comprise at least one of letters, numbers, punctuation, symbols, and characters of foreign 
languages. 

17. (Currently Amended) The system of claim 13, whoroin the particular character 
sequences comprise at least one of random letters, symbols, and punctuation as chaff at 
any one of a beginning and end of at least one of a subject line of a message and a 
message body. 

18. (Currently Amended) The system of claim 17, wherein random character 
sequences comprise character n-grams which are indicative of spam-like messages. 

19. (Currently Amended) The system of claim 1 8, whoroin the character n-grams are 
located in at least one of From address, subject line, text body, html body, and 
attachments. 
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20. (Currently Amended) The system of claim 18, wherein the character n-grams are 
position dependent. 

2 1 . (Currently Amended) The system of claim 1 3.^ for use with the messages the 
portion of the message comprising at least one of foreign language text, Unicode 
character types, and other character types not common to English 

22. (Currently Amended) The system of claim 2 1 , wherein the foreign language text 
comprises substantially non-space separated words. 

23. (Currently Amended) The system of claim 22, wherein n-grams are used emy for 
characters not typically separated by spaces. 

24. (Original) The system of claim 13, further comprising a component that extracts 
character sequences obfuscated by punctuation using a pattern-match technique. 

25. (Currently Amended) A spam detection system comprising: 

a character sequencing component that analyzes a portion of a message via 
searching for instances of a string of random characters that are indicative of the message 
being spa m; and 

a feature generating component that generates features corresponding to the 
instances of random character strings to facilitate determining an entropy measurement 
for each string, the entropy measurement is used to indicate the message as being spam or 
not spam . 

26. (Cancelled). 

27. (Currently Amended) The system of claim 25, wherein the system measures a 
value correlated with entropy. 
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28. (Currently Amended) The system of claim 27, wherein a high value correlated 
with entropy is indicative of spam. 

29. (Currently Amended) The system of claim 28, whoroin the value correlated with 
entropy is the actual entropy -log 2 P(abc. . .z) 

30. (Currently Amended) The system of claim 27, wherein the average entropy of a 
character string is used. 

3 1 . (Currently Amended) The system of claim 25, wherein the string of random 
characters is chaff. 

32. (Currently Amended) The system of claim 27, wherein the relative entropy 
compares the entropy measurement at any one of a beginning and end of at least one of a 
subject line and message body with the entropy measurement at a middle of at least one 
of the subject line and message body. 

33. (Currently Amended) A spam detection system comprising: 

a component that analyzes substantially all features of a message header in 
connection with training a machine learning spam filte r, the component generates feature 
pairs; and 

a spam filter that detects spam based at least in part on a comparison of the feature 

pairs . 

34. (Currently Amended) The system of claim 33, whoroin the features of the 
message header comprise at least one of a presence and absence of at least one message 
header type, the message header types comprising X-Priority, mail software, and headers 
line for unsubscribing. 
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35. (Currently Amended) The system of claim 34, wherein the features of the 
message header further comprise content associated with at least one message header 
type. 

36. (Original) The system of claim 33, further comprising: 

a component that analyzes at least a portion of a message for images and related 
image information; and 

a component that generates features relating to any one of the images and related 
image information. 

37. (Currently Amended) The system of claim 36, wherein the image information 
comprises image size, image quantity, location of image, image dimensions, and image 
type. 

38. (Currently Amended) The system of claim 36, wherein the image information 
comprises the presence of a first URL and a second URL such that the image is inside of 
a hyperlink. 

39. (Currently Amended) The system of claim 38, wherein the message comprises a 
tag pattern having the form of <A HREF="the first URL"><IMG SRC="the second 
URL"></A>. 

40. (Currently Amended) The system of claim 36, wherein the features are used in 
connection with training a machine learning filter. 

41 . (Original) The system of claim 33, further comprising a component that analyzes 
a message for HTML attributes and location of HTML attributes as they appear in a tag 
pattern. 
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42. (Currently Amended) A method that facilitates generating features for use in 
spam detection comprising: 

receiving at least one message; 

parsing at least a portion of a message to generate one or more features; 

combining at least two features into pairs, whore - by each pair of features creates at 
least one additional feature, the features of each pair coinciding with one another;-and 

using the pairs of features to train a machine learning spam filter regarding 
acceptable or unacceptable pairs; and 

detecting a spam e-mail based at least in part on comparing one or more pairs of 
features in the e-mail to at least one pair in the machine learning spam filter . 

43. (Currently Amended) The method of claim 42, wherein the at least a portion of 
the message being parsed corresponds to origination information of the message. 

44. (Currently Amended) The method of claim 42, wherein each pair comprises at 
least one of the following: 

at least one of a domain name and a host name in a MAIL FROM command; 

at least one of a domain name and a host name in a HELO COMMAND; 

at least one of an IP address and a subnet in a Received from header; 

at least one of a domain name and a host name in a Display name; 

at least one of a domain name and a host name in a Message From line; and 

at least one time zone in a last Received from header. 

45. (Currently Amended) The method of claim 44, wherein the domain name is 
derived from the host name. 



46. (Currently Amended) The method of claim 42, wherein the pair of features is a 
Display name domain name and host name and a Message From domain name and host 
name. 
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47. (Currently Amended) The method of claim 42, wherein a useful pair is any one of 
a domain name and a host name from a Message From and from a HELO command. 

48. (Currently Amended) The method of claim 42, whoroin the pair of features is any 
one of a domain name and a host name in a Message From and any one of a Received 
from IP address and subnet. 

49. (Currently Amended) The method of claim 42, wherein the pair of features is a 
sender's alleged time zone and a Message From domain name. 

50. (Currently Amended) The method of claim 42, wherein the pair of features 
comprises a sender's type of mailing software and any one of a domain name, host name 
and display name derived from one of an SMTP command and a message header. 

5 1 . (Original) The method of claim 42, further comprising selecting one or more most 
useful pairs of features to train the machine learning filter. 

52. (Currently Amended) The method of claim 42, further comprising employing the 
machine learning filter after it is trained to detect spam by performing the following the 
detecting a spam e-mail based at least in part on one of: 

receiving new messages; 

generating pairs of features based on origination information in the messages; 
passing the pairs of features through the machine learning filter; and 
obtaining a verdict as to whether at least one pair of features indicates that the 
message is more likely to be spam. 



9 



10/601,741 



MS303962.01/MSFTP438US 



53. (Currently Amended) A method that facilitates generating features for use in 
spam detection comprising: 

receiving one or more messages; 

walking through at least a portion of the message to create features for each run of 
characters of any run length; and 

training a machine learning filter on spam-indicative features using at least a 
portion of the created features , the filter subsequently identifies at least one spam- 
indicative feature in a message regardless of whitespace or extraneous characters in the 
features of the message . 

54. (Original) The method of claim 53, further comprising generating features 
relating to a position of at least one run of characters. 

55. (Currently Amended) The method of claim 54, wherein the position comprises 
any one of a beginning of a message body, an end of a message body, a middle of a 
message body, a beginning of a subject line, an end of a subject line, and a middle of a 
subject line. 

56. (Currently Amended) The method of claim 53, wherein the features are created 
for a run of characters up to length n. 

57. (Currently Amended) The method of claim 53, wherein the features are created 
for sub-lengths of runs of characters. 

58. (Currently Amended) The method of claim 53, whoroin the run of characters 
comprise character n-grams. 

59. (Original) The method of claim 53, further comprising calculating an entropy of 
one or more run of characters and employing the calculated entropy as a feature in 
connection with training a spam filter. 
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60. (Currently Amended) The method of claim 59, wherein the entropy is at least one 
of high entropy, average entropy, and relative entropy. 

6 1 . (Currently Amended) The method of claim 60, whoroin the average entropy is the 
entropy per character of a particular run of characters. 

62. (Currently Amended) The method of claim 60, wherein the relative entropy is a 
comparison of the entropy of a particular run of characters at a first location relative to 
the entropy of a particular run of characters at a second location of the message. 

63. (Currently Amended) The method of claim 62, wherein the first and second 
locations comprise a beginning of a subject line, a middle of a subject line, and an end of 
a subject line, whereby the first location is not the same as the second location when 
determining the relative entropy for any given run of characters. 

64. (Currently Amended) The method of claim 62, whoroin the first and second 
locations comprise a beginning of a message, a middle of a message, and an end of a 
message, wh e r e by the first location is not the same as the second location when 
determining the relative entropy for any given run of characters. 

65. (Original) The method of claim 53, further comprising employing the machine 
learning filter after it is trained to detect spam by performing the following: 

receiving new messages; 

generating features based at least one of runs of characters and entropy 

determinations of runs of characters in the messages; 

passing the features through the machine learning filter; and 

obtaining a verdict as to whether the features indicate that the message is more 

likely to be spam. 
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66. (Currently Amended) A method that facilitates generating features for use in 
spam detection comprising: 

receiving one or more messages; 

analyzing substantially all features of a message heade r, the features are compared 
to determine inconsistencies indicative of spam ; and 

training a machine learning filter using the analyzed features. 

67. (Original) The method of claim 66, further comprising analyzing substantially all 
features based on image information in the message. 

68. (Original) A computer readable medium comprising the method of claim 42. 

69. (Original) A computer readable medium comprising the method of claim 53. 

70. (Currently Amended) A computer-readable medium having stored thereon the 
following computer executable components: 

a component that identifies features relating to at least a portion of origination 
information of a message; and 

a component that combines the features into useful pairs , the pairs are evaluated 
for consistency with respect to one another to determine if the message is spam for use in 
connection with training a machine learning filter to facilitate detecting spam . 

71 . (Currently Amended) The computer readable medium of claim 70, further 
comprising: 

a component that analyzes a portion of a message via searching for particular 
character sequences that are indicative of spam, wherein the particular sequences are not 
restricted to whole words; and 

a component that generates features relating to the character sequences of any 

length. 
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72. (Original) The computer readable medium of claim 70, further comprising: 

a component that analyzes a portion of a message via searching for instances of a 
string of random characters that are indicative of the message being spam. 

73. (Currently Amended) A system that facilitates generating features for use in spam 
detection comprising: 

[[a]] means for receiving at least one message; 

[[a]] means for parsing at least a portion of a message to generate one or more 
features; 

[[a]] means for combining at least two features into pairs , the pairs are evaluated 
against each other for consistency [[,]] whereby each pair of features creates at least one 
additional feature, the features of each pair coinciding with one another ; and 

[[a]] means for using the pairs of features to train a machine learning spam filter. 

74. (Currently Amended) A system that facilitates generating features for use in spam 
detection comprising: 

[[a]] means for receiving one or more messages; 

[[a]] means for walking through at least a portion of the message to create features 
for each run of characters of any run length; and 

[[a]] means for training a machine learning filter on spam indicative features 
using at least a portion of the created features. 

75. (Currently Amended) The system of claim 74, further comprising means for 
calculating an entropy of one or more run of characters and employing the calculated 
entropy as a feature in connection with training a spam filter. 
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